Best Practices & Common Pitfalls in Machine Learning

Data Biases & Sampling Issues

Sample bias: Occurs when the sample is not representative of the population.
Selection bias: Certain kinds of data are systematically included or excluded.
Response bias: Labels or annotations are systematically influenced (e.g., survey bias).
real-business scenario:
- activity bias (social media content)
- societal bias (human generated content)
- selection bias (generated by the model itself to form a feedback loop)

covariant drift:
- data distribution of independent variables shifts
label/prior drift:
- data distribution of target variables shifts (the output distribution changes but, for a given output, the input distribution stays the same)
concept/posterior drift:
- the definition of label itself changes based on a particular feature (the input distribution remains the same but the conditional distribution of the output given an input changes)
general data distribution shifts
- feature definition change
- label schema change

Endogeneity
- = a situation where there is a correlation between the predictor variables and the error term in a statistical model.
- e.g. Rich becomes richer, poor becomes poorer.
Correlation vs. Causation
Multicollinearity
Underfitting vs. Overfitting

Accuracy becomes misleading when classes are imbalanced.

Choose other metrics for classification problems, such as precision, recall, F-score, balanced accuracy

Undersampling: down-sampling the larger set
- by randomly throwing away some data from that set
Oversampling: up-sampling the smaller set
- Direct duplication: by making multiple copies of the data points in the smaller set (can cause the model to be over-fitting)
- by using synthetic data creation such as:
  - synthetic minority oversampling technique (SMOTE)
    - use the existing data in the smaller set to create new data points that look like the existing ones: use the feature vectors of the minority classes to generate syntehtic data points that are between real data points and their k-nearest neighbours
  - adaptive synthetic sampling methods (ADASYN)
Algorithm-level methods
- use Ensemble Learning methods, because each model in the ensemble can be trained on a different subset of the data
- keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
- cost-sensitive learning
- class-balanced loss
- focal loss

Class imbalance: the imbalance in the number of members between different facet values
Difference in proportions of labels (DPL): the imbalance of positive outcomes between different facet values

correctly labeled datasets are often called "ground truth"
efficient data labelling
- access to additional human workforces: Machine Learning Systems Design#Human-in-the-Loop Pipelines
- automated data labelling capabilities
- assistive labelling features
label multiplicity/ambiguity
- multiple annotators/data sources cause conflicting labels
- solutions: majority vote, soft labels, probabilistic labels, annotator modeling

Always consider two aspects with regards to generalization:

feature coverage
- the percentage of the samples that has values for this feature in the data -> the fewer values that are missing, the higher the coverage.
the distribution of feature values

Always consider: